1.

Apply dimensionality reduction techniques to the Big Five and MBTI portions of the data and determine what components explain what personality traits.

Principal Components Analysis will be used for the dimensionality reduction in this session.

rawdata <- read_excel("/Users/lasgalen/Desktop/BDS 516/data in class/personality_test_data.xlsx", col_names = TRUE)
data_clean <- rawdata[-1,-55] # column 55 is the same as column 54, therefore drop column 55
data_clean <- as.data.frame(sapply(data_clean, as.numeric))

According to the data, we can see that the first 50 questions are Big Five test, and the following 70 questions are MBTI test.

reference: Big Five:https://canvas.upenn.edu/courses/1567977/files/95666049?module_item_id=20534870 MBTI: https://canvas.upenn.edu/courses/1567977/files/95666070?module_item_id=20534871

bigfive <- data_clean[,c(1:50)]
mbti <- data_clean[,c(51:120)]
Big Five
pr.out_bigfive <- prcomp(drop_na(bigfive), scale = TRUE, center = TRUE)
summary(pr.out_bigfive)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.6817 2.3170 2.08142 2.07554 1.75867 1.44765 1.41390
## Proportion of Variance 0.1438 0.1074 0.08665 0.08616 0.06186 0.04191 0.03998
## Cumulative Proportion  0.1438 0.2512 0.33785 0.42401 0.48586 0.52778 0.56776
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.24843 1.19557 1.16879 1.14573 1.04761 1.02021 0.99600
## Proportion of Variance 0.03117 0.02859 0.02732 0.02625 0.02195 0.02082 0.01984
## Cumulative Proportion  0.59893 0.62752 0.65484 0.68109 0.70304 0.72386 0.74370
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.97806 0.92315 0.90128 0.89583 0.85095 0.82386 0.80490
## Proportion of Variance 0.01913 0.01704 0.01625 0.01605 0.01448 0.01357 0.01296
## Cumulative Proportion  0.76283 0.77988 0.79612 0.81217 0.82666 0.84023 0.85319
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.77209 0.74759 0.73366 0.70329 0.66675 0.62017 0.60793
## Proportion of Variance 0.01192 0.01118 0.01077 0.00989 0.00889 0.00769 0.00739
## Cumulative Proportion  0.86511 0.87629 0.88705 0.89695 0.90584 0.91353 0.92092
##                          PC29    PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     0.5916 0.57781 0.56853 0.55792 0.51920 0.50044 0.49263
## Proportion of Variance 0.0070 0.00668 0.00646 0.00623 0.00539 0.00501 0.00485
## Cumulative Proportion  0.9279 0.93460 0.94106 0.94729 0.95268 0.95769 0.96254
##                           PC36    PC37    PC38    PC39    PC40    PC41    PC42
## Standard deviation     0.47609 0.44500 0.43142 0.41390 0.40576 0.37883 0.36600
## Proportion of Variance 0.00453 0.00396 0.00372 0.00343 0.00329 0.00287 0.00268
## Cumulative Proportion  0.96707 0.97104 0.97476 0.97818 0.98148 0.98435 0.98703
##                           PC43   PC44    PC45    PC46    PC47    PC48    PC49
## Standard deviation     0.34856 0.3390 0.30428 0.28758 0.26839 0.23885 0.23761
## Proportion of Variance 0.00243 0.0023 0.00185 0.00165 0.00144 0.00114 0.00113
## Cumulative Proportion  0.98946 0.9918 0.99361 0.99526 0.99670 0.99784 0.99897
##                           PC50
## Standard deviation     0.22683
## Proportion of Variance 0.00103
## Cumulative Proportion  1.00000
biplot(pr.out_bigfive)

pr.var_bigfive <- pr.out_bigfive$sdev^2
pve_bigfive <- pr.var_bigfive / sum(pr.var_bigfive)
plot(pve_bigfive, xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

plot(cumsum(pve_bigfive), xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

fviz_eig(pr.out_bigfive)
## Registered S3 methods overwritten by 'car':
##   method                          from
##   influence.merMod                lme4
##   cooks.distance.influence.merMod lme4
##   dfbeta.influence.merMod         lme4
##   dfbetas.influence.merMod        lme4

fviz_pca_var(pr.out_bigfive,
             col.var = "contrib", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE    
             )
## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

From the PCA analysis and the PCA result description graphs above, we can see that Q24, Q34, Q12 tend to be the most powerful question, while Q32, Q37, Q42, Q53 also plays important roles in determining the personality type.

MBTI
pr.out_mbti <- prcomp(drop_na(mbti), scale = TRUE, center = TRUE)
summary(pr.out_mbti)
## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.8966 2.24527 1.88836 1.83641 1.79113 1.65255 1.58786
## Proportion of Variance 0.1199 0.07202 0.05094 0.04818 0.04583 0.03901 0.03602
## Cumulative Proportion  0.1199 0.19188 0.24282 0.29100 0.33683 0.37584 0.41186
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.54759 1.50800 1.43688 1.38622 1.33399 1.30985 1.27741
## Proportion of Variance 0.03421 0.03249 0.02949 0.02745 0.02542 0.02451 0.02331
## Cumulative Proportion  0.44608 0.47856 0.50806 0.53551 0.56093 0.58544 0.60875
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     1.24054 1.23067 1.18504 1.14823 1.11837 1.10746 1.08663
## Proportion of Variance 0.02198 0.02164 0.02006 0.01883 0.01787 0.01752 0.01687
## Cumulative Proportion  0.63074 0.65237 0.67244 0.69127 0.70914 0.72666 0.74353
##                           PC22    PC23    PC24   PC25    PC26   PC27    PC28
## Standard deviation     1.06343 1.04221 1.02443 0.9862 0.97246 0.9241 0.92258
## Proportion of Variance 0.01616 0.01552 0.01499 0.0139 0.01351 0.0122 0.01216
## Cumulative Proportion  0.75968 0.77520 0.79019 0.8041 0.81760 0.8298 0.84195
##                           PC29   PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     0.89541 0.8776 0.83884 0.81739 0.80859 0.78296 0.77422
## Proportion of Variance 0.01145 0.0110 0.01005 0.00954 0.00934 0.00876 0.00856
## Cumulative Proportion  0.85341 0.8644 0.87446 0.88401 0.89335 0.90210 0.91067
##                           PC36    PC37   PC38    PC39    PC40    PC41    PC42
## Standard deviation     0.73858 0.69638 0.6744 0.66679 0.64162 0.59865 0.59339
## Proportion of Variance 0.00779 0.00693 0.0065 0.00635 0.00588 0.00512 0.00503
## Cumulative Proportion  0.91846 0.92539 0.9319 0.93824 0.94412 0.94924 0.95427
##                           PC43    PC44    PC45    PC46    PC47   PC48    PC49
## Standard deviation     0.57636 0.53445 0.51996 0.49989 0.47643 0.4735 0.45927
## Proportion of Variance 0.00475 0.00408 0.00386 0.00357 0.00324 0.0032 0.00301
## Cumulative Proportion  0.95901 0.96309 0.96696 0.97052 0.97377 0.9770 0.97998
##                           PC50    PC51    PC52   PC53    PC54    PC55    PC56
## Standard deviation     0.44105 0.42140 0.38547 0.3649 0.33534 0.31390 0.30850
## Proportion of Variance 0.00278 0.00254 0.00212 0.0019 0.00161 0.00141 0.00136
## Cumulative Proportion  0.98276 0.98530 0.98742 0.9893 0.99093 0.99234 0.99370
##                           PC57    PC58    PC59    PC60    PC61    PC62    PC63
## Standard deviation     0.29327 0.26975 0.25016 0.23457 0.19745 0.17125 0.16369
## Proportion of Variance 0.00123 0.00104 0.00089 0.00079 0.00056 0.00042 0.00038
## Cumulative Proportion  0.99493 0.99597 0.99686 0.99765 0.99820 0.99862 0.99900
##                           PC64    PC65    PC66    PC67    PC68    PC69    PC70
## Standard deviation     0.13934 0.12981 0.12433 0.09669 0.06997 0.05290 0.03003
## Proportion of Variance 0.00028 0.00024 0.00022 0.00013 0.00007 0.00004 0.00001
## Cumulative Proportion  0.99928 0.99952 0.99974 0.99988 0.99995 0.99999 1.00000
biplot(pr.out_mbti)

pr.var_mbti <- pr.out_mbti$sdev^2
pve_mbti <- pr.var_mbti / sum(pr.var_mbti)
plot(pve_mbti, xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

plot(cumsum(pve_mbti), xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

fviz_eig(pr.out_mbti)

fviz_pca_ind(pr.out_mbti,
             col.ind = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE 
             )

fviz_pca_var(pr.out_mbti,
             col.var = "contrib", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE    
             )
## Warning: ggrepel: 38 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

fviz_pca_biplot(pr.out_mbti, repel = TRUE,
                col.var = "#2E9FDF", 
                col.ind = "#696969" 
                )
## Warning: ggrepel: 16 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

From the PCA analysis and the PCA result description graphs above, we can see that Q56, Q97, Q109 tend to be the most powerful question in determining the personality type.

2.

Show if individuals are clustered according to Big Five and MBTI questions.

Big Five
#PCA
fviz_pca_ind(pr.out_bigfive,
             col.ind = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE 
             )
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

set.seed(123)

# k-means
elbow_method_bigfive <- fviz_nbclust(drop_na(bigfive), FUNcluster = kmeans, method = "wss")
elbow_method_bigfive

silhouette_method_bigfive <- fviz_nbclust(drop_na(bigfive), FUNcluster = kmeans, method = "silhouette")
silhouette_method_bigfive

# The optimal number of cluster is 2
km.out_bigfive <- kmeans(drop_na(bigfive),2,nstart=20)
km.out_bigfive
## K-means clustering with 2 clusters of sizes 29, 59
## 
## Cluster means:
##         Q1       Q4       Q6       Q7       Q8       Q9      Q10      Q11
## 1 2.827586 2.034483 3.551724 4.068966 3.413793 2.310345 4.137931 2.724138
## 2 3.000000 1.593220 3.745763 2.728814 3.779661 2.440678 4.593220 2.779661
##        Q12      Q13      Q14      Q15      Q16      Q17      Q18      Q19
## 1 2.172414 2.000000 3.344828 2.310345 4.206897 4.586207 4.137931 2.827586
## 2 3.440678 1.779661 4.203390 1.576271 4.203390 3.694915 4.016949 2.559322
##        Q20      Q21      Q22      Q23      Q24      Q25      Q26      Q27
## 1 4.103448 2.793103 1.965517 1.586207 3.448276 2.000000 2.862069 3.275862
## 2 4.542373 2.000000 3.169492 1.593220 4.203390 2.067797 3.050847 2.423729
##        Q28      Q29      Q30      Q31      Q32      Q33      Q34      Q35
## 1 3.862069 2.206897 3.931034 2.655172 4.034483 1.862069 2.896552 2.206897
## 2 4.033898 1.915254 3.949153 2.457627 1.932203 1.644068 3.457627 1.508475
##        Q36      Q37      Q38      Q39      Q40      Q41      Q42      Q43
## 1 4.068966 3.896552 3.931034 3.482759 3.724138 2.172414 3.482759 3.103448
## 2 3.966102 2.288136 4.406780 3.474576 4.169492 1.677966 1.677966 2.932203
##        Q45      Q46      Q47      Q48      Q49      Q50      Q51      Q52
## 1 2.965517 4.034483 3.724138 3.758621 4.586207 3.137931 3.896552 3.965517
## 2 3.084746 4.372881 3.711864 2.305085 4.406780 3.016949 4.288136 3.830508
##        Q53      Q54
## 1 3.551724 4.034483
## 2 1.796610 4.271186
## 
## Clustering vector:
##  [1] 2 2 2 2 1 2 2 1 2 1 1 1 2 1 2 2 1 2 1 2 1 1 2 2 2 1 1 2 1 2 2 2 2 2 2 2 2 2
## [39] 2 2 1 1 2 2 2 2 1 1 2 2 2 2 2 1 2 1 2 1 2 2 1 2 2 1 2 2 1 2 1 1 2 2 2 2 2 2
## [77] 2 1 2 2 2 1 2 2 2 1 2 1
## 
## Within cluster sum of squares by cluster:
## [1] 1532.069 3109.356
##  (between_SS / total_SS =  10.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
# Hierarchical Clustering
hccomplete_bigfive <- hclust(dist(drop_na(bigfive)), method="complete")
dend_bigfive <- as.dendrogram(hccomplete_bigfive)
dend.bigfive <- color_branches(dend_bigfive, k = 2)
plot(dend_bigfive)

bigfiveclusters <- cutree(hccomplete_bigfive, k = 2)

# Gap Statitical Method
gap_stat <- clusGap(drop_na(bigfive), FUN = kmeans, nstart = 25, K.max = 10, B = 50)
gap_stat_method <- fviz_gap_stat(gap_stat)
gap_stat_method

From the graph of individuals from the PCA based on Big Five questions, we can see that there are no significant clusters. However, individuals are tend to gather in the negative side of Dim1.

Then we ran the k-means clustering with elbow method and silhouette method. While there is no significant elbow, the results of silhouette method suggest the optimal number of clusters is 2. So we run k-means with 2 clusters. Two clusters have a size of 29 and 59 seperately.

After running Hierarchical Clustering. we color the Hierarchical Clustering tree with two different color and divide it inton two clusters.

The gap statistical method shows that the optimial clusters is one. We think this may because of the small sample size.

MBTI
#PCA
fviz_pca_ind(pr.out_mbti,
             col.ind = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE 
             )

# k-means
elbow_method_mbti <- fviz_nbclust(drop_na(mbti), FUNcluster = kmeans, method = "wss")
elbow_method_mbti

silhouette_method_mbti <- fviz_nbclust(drop_na(mbti), FUNcluster = kmeans, method = "silhouette")
silhouette_method_mbti

# The optimal number of cluster is 2
km.out_mbti <- kmeans(drop_na(mbti),2,nstart=20)
km.out_mbti
## K-means clustering with 2 clusters of sizes 27, 50
## 
## Cluster means:
##       Q55     Q56      Q57      Q58      Q60      Q61      Q62      Q63
## 1 1.37037 1.62963 1.888889 1.444444 1.296296 1.222222 1.481481 1.296296
## 2 1.70000 1.14000 1.720000 1.140000 1.320000 1.260000 1.180000 1.520000
##        Q64      Q65      Q66      Q67      Q68      Q69     Q70      Q71
## 1 1.592593 1.777778 1.888889 1.851852 1.555556 1.111111 1.37037 1.888889
## 2 1.420000 1.740000 1.760000 1.480000 1.340000 1.000000 1.46000 1.560000
##       Q72      Q73      Q74      Q75      Q76      Q77      Q78      Q79
## 1 1.62963 1.666667 1.666667 1.148148 1.740741 1.407407 1.592593 1.851852
## 2 1.44000 1.540000 1.180000 1.040000 1.460000 1.480000 1.460000 1.700000
##        Q80      Q81      Q82      Q83      Q84     Q85      Q86      Q87
## 1 1.666667 1.407407 1.444444 1.777778 1.111111 1.62963 1.777778 1.666667
## 2 1.480000 1.280000 1.080000 1.780000 1.420000 1.54000 1.800000 1.200000
##        Q88      Q89      Q90      Q91      Q92      Q93      Q94      Q95
## 1 1.814815 1.962963 1.888889 1.074074 1.481481 1.703704 1.222222 1.518519
## 2 1.500000 1.740000 1.820000 1.340000 1.160000 1.520000 1.320000 1.260000
##        Q96      Q97      Q98      Q99     Q100     Q101     Q102     Q103
## 1 1.518519 1.777778 1.962963 1.666667 1.925926 1.444444 1.592593 1.851852
## 2 1.320000 1.240000 1.880000 1.440000 1.800000 1.200000 1.660000 1.460000
##       Q104     Q105     Q106     Q107    Q108    Q109    Q110     Q111     Q112
## 1 1.185185 1.222222 1.555556 1.518519 1.37037 1.62963 1.62963 1.481481 1.777778
## 2 1.060000 1.680000 1.200000 1.260000 1.10000 1.44000 1.14000 1.260000 1.600000
##       Q113     Q114     Q115     Q116     Q117     Q118     Q119     Q120
## 1 1.592593 1.592593 1.666667 1.777778 1.518519 1.666667 1.148148 1.703704
## 2 1.180000 1.160000 1.400000 1.640000 1.160000 1.220000 1.320000 1.360000
##       Q121     Q122     Q123     Q124     Q125
## 1 1.777778 1.518519 1.444444 1.666667 1.740741
## 2 1.640000 1.260000 1.280000 1.240000 1.120000
## 
## Clustering vector:
##  [1] 1 2 1 1 2 1 2 2 2 1 2 2 2 1 1 2 2 2 2 1 2 2 2 2 1 2 1 1 2 2 1 1 2 2 1 2 2 2
## [39] 2 2 1 1 2 1 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 1 2 2 2 1 1 1 1 2 2 1 2 1 2 1
## [77] 2
## 
## Within cluster sum of squares by cluster:
## [1] 364.2963 664.0000
##  (between_SS / total_SS =   8.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
# Hierarchical Clustering
hccomplete_mbti <- hclust(dist(drop_na(mbti)), method="complete")
dend_mbti <- as.dendrogram(hccomplete_mbti)
dend.mbti <- color_branches(dend_mbti, k = 2)
plot(dend_mbti)

mbticlusters <- cutree(hccomplete_mbti, k = 2)

# Gap Statistic Method
gap_stat <- clusGap(drop_na(mbti), FUN = kmeans, nstart = 25, K.max = 10, B = 50)
gap_stat_method <- fviz_gap_stat(gap_stat)
gap_stat_method

From the graph of individuals from the PCA based on MBTI questions, we can see that there are also no significant clusters. However, individuals are tend to gather in the middle(low cos2 values, means that these individuals have less contribution to Dim1 and Dim2).

Then we ran the k-means clustering with elbow method and silhouette method. While there is also no significant elbow, the results of silhouette method suggest the optimal number of clusters is still 2. So we run k-means with 2 clusters. Two clusters have a size of 50 and 27 seperately.

After running Hierarchical Clustering. we color the Hierarchical Clustering tree with two different color and divide it into two clusters.

The gap statistical method shows that the optimial clusters is one. We think this may because of the small sample size.

####3.

Compare the results of the dimensionality reduction of Big 5 and MBTI.

fviz_pca_biplot(pr.out_bigfive, repel = TRUE,
                col.var = "#2E9FDF", 
                col.ind = "#696969", 
                title = "PCA Biplot for Big Five"
                )
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

fviz_pca_biplot(pr.out_mbti, repel = TRUE,
                col.var = "#2E9FDF", 
                col.ind = "#696969",
                title = "PCA Biplot for MBTI"
                )
## Warning: ggrepel: 16 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

From the PCA Biplot of Big Five and MBTI test, we can see that most of the variables in Big Five test are showing the large variation in Dim1 in both direction (positive and negative). But variables in MBTI test are mainly showing negative variance in Dim1.

Also, individuals in MBTI test are relatively more dispersive compared with individuals in Big Five test.

The PCA method is particularly useful when the variables within the data set are highly correlated. Correlation indicates that there is redundancy in the data. Due to this redundancy, PCA can be used to reduce the original variables into a smaller number of new variables (principal components) explaining most of the variance in the original variables. Considering the pattern of Big Five data and MBTI data demonstrated by PCA, Big Five data may have higher redundancy.

####4.

Check to see if there is correspondence between MBTI and Big Five items.

scale(drop_na(data_clean)) -> data_clean3

pr.out_data_clean3 <- prcomp(data_clean3, scale = TRUE, center = TRUE)
summary(pr.out_data_clean3)
## Importance of components:
##                            PC1     PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     3.46362 2.90639 2.88738 2.69513 2.18728 1.94777 1.8909
## Proportion of Variance 0.09997 0.07039 0.06947 0.06053 0.03987 0.03162 0.0298
## Cumulative Proportion  0.09997 0.17036 0.23984 0.30037 0.34024 0.37185 0.4017
##                            PC8     PC9    PC10   PC11    PC12    PC13    PC14
## Standard deviation     1.85466 1.82770 1.78015 1.7076 1.67318 1.61901 1.59684
## Proportion of Variance 0.02866 0.02784 0.02641 0.0243 0.02333 0.02184 0.02125
## Cumulative Proportion  0.43031 0.45815 0.48456 0.5089 0.53219 0.55403 0.57528
##                           PC15    PC16    PC17  PC18    PC19    PC20    PC21
## Standard deviation     1.55494 1.52513 1.47649 1.428 1.36780 1.34882 1.34635
## Proportion of Variance 0.02015 0.01938 0.01817 0.017 0.01559 0.01516 0.01511
## Cumulative Proportion  0.59543 0.61481 0.63298 0.650 0.66557 0.68073 0.69584
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     1.33381 1.32231 1.27530 1.25594 1.22916 1.19432 1.17149
## Proportion of Variance 0.01483 0.01457 0.01355 0.01314 0.01259 0.01189 0.01144
## Cumulative Proportion  0.71066 0.72523 0.73879 0.75193 0.76452 0.77641 0.78784
##                           PC29    PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     1.16421 1.14647 1.12819 1.09700 1.06475 1.04389 1.01927
## Proportion of Variance 0.01129 0.01095 0.01061 0.01003 0.00945 0.00908 0.00866
## Cumulative Proportion  0.79914 0.81009 0.82070 0.83073 0.84017 0.84925 0.85791
##                          PC36   PC37    PC38    PC39    PC40    PC41    PC42
## Standard deviation     1.0043 0.9918 0.96271 0.95582 0.93475 0.90534 0.86840
## Proportion of Variance 0.0084 0.0082 0.00772 0.00761 0.00728 0.00683 0.00628
## Cumulative Proportion  0.8663 0.8745 0.88224 0.88985 0.89713 0.90396 0.91025
##                           PC43    PC44    PC45    PC46    PC47    PC48   PC49
## Standard deviation     0.84643 0.83824 0.82126 0.77233 0.76760 0.73332 0.7182
## Proportion of Variance 0.00597 0.00586 0.00562 0.00497 0.00491 0.00448 0.0043
## Cumulative Proportion  0.91622 0.92207 0.92769 0.93266 0.93757 0.94206 0.9464
##                           PC50    PC51    PC52    PC53   PC54    PC55    PC56
## Standard deviation     0.69091 0.67419 0.67151 0.64043 0.6295 0.61496 0.60700
## Proportion of Variance 0.00398 0.00379 0.00376 0.00342 0.0033 0.00315 0.00307
## Cumulative Proportion  0.95033 0.95412 0.95788 0.96130 0.9646 0.96775 0.97082
##                           PC57    PC58   PC59    PC60    PC61    PC62    PC63
## Standard deviation     0.56752 0.56432 0.5591 0.53054 0.50725 0.48691 0.47846
## Proportion of Variance 0.00268 0.00265 0.0026 0.00235 0.00214 0.00198 0.00191
## Cumulative Proportion  0.97350 0.97616 0.9788 0.98111 0.98325 0.98523 0.98714
##                           PC64    PC65    PC66    PC67    PC68    PC69    PC70
## Standard deviation     0.46689 0.43168 0.42076 0.40039 0.37667 0.36031 0.35412
## Proportion of Variance 0.00182 0.00155 0.00148 0.00134 0.00118 0.00108 0.00105
## Cumulative Proportion  0.98895 0.99051 0.99198 0.99332 0.99450 0.99558 0.99663
##                           PC71    PC72    PC73    PC74    PC75      PC76
## Standard deviation     0.33075 0.31150 0.29128 0.24175 0.23497 1.538e-15
## Proportion of Variance 0.00091 0.00081 0.00071 0.00049 0.00046 0.000e+00
## Cumulative Proportion  0.99754 0.99835 0.99905 0.99954 1.00000 1.000e+00
biplot(pr.out_data_clean3)

pr.var_data_clean3 <- pr.out_data_clean3$sdev^2
pve_data_clean3 <- pr.var_data_clean3 / sum(pr.var_data_clean3)
plot(pve_data_clean3, xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

plot(cumsum(pve_data_clean3), xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

fviz_eig(pr.out_data_clean3)

fviz_pca_var(pr.out_data_clean3,
             col.var = "contrib", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE    
             )
## Warning: ggrepel: 85 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Question 35 (“I am not really interested in others”) and question 55 (easy or not to speak to strangers) are very close to each other. The questions around them are also relevant to how comfortable one is when being around with people. After investigating their nature, we found that they are both questions related to “ex/introversion.”

Question 20,30, and 40 are shown to be substantial and on the same direction. From the questions, they seem to be measuring “agreeableness” (to what degree an individual adjust their behavior to suit others.) The same quadrant also stands question 73 and 80 from mbti are also similar in that they measure how much you value harmonious human relationship. Therefore, we think that this quadrant is related to the concept of “agreeableness.”

####5.

Predict my MBTI personality using an appropriate method and compare it with my actual score

data_clean <- drop_na(data_clean)

elbow_method <- fviz_nbclust(data_clean, FUNcluster = kmeans, method = "wss")
elbow_method

silhouette_method <- fviz_nbclust(data_clean, FUNcluster = kmeans, method = "silhouette")
silhouette_method

According to the elbow and silhouette analysis, we can see that the optimal number of clusters is 3.

# The optimal number of cluster is 3
# Use Hierarchical Clustering to identify the clusters
hccomplete <- hclust(dist(drop_na(data_clean)), method="complete")
dend <- as.dendrogram(hccomplete)
dend <- color_branches(dend, k = 3)
plot(dend)

clusters <- cutree(hccomplete, k = 3)

data_clean %>% mutate(cluster = clusters) %>% 
  group_by(cluster) %>% 
  summarise_all(funs(mean(.,  na.rm = T)))
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## # A tibble: 3 x 121
##   cluster    Q1    Q4    Q6    Q7    Q8    Q9   Q10   Q11   Q12   Q13   Q14
##     <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1       1  2.82  1.76  3.87  2.8   3.78  2.51  4.58  2.58  3.29  1.87  4.27
## 2       2  2.92  2     3.62  4.12  3.65  2.15  4.12  2.58  2.19  1.92  3.46
## 3       3  3.8   1.2   2     2.4   4.6   3     4.8   4.6   3.8   1.6   4   
## # … with 109 more variables: Q15 <dbl>, Q16 <dbl>, Q17 <dbl>, Q18 <dbl>,
## #   Q19 <dbl>, Q20 <dbl>, Q21 <dbl>, Q22 <dbl>, Q23 <dbl>, Q24 <dbl>,
## #   Q25 <dbl>, Q26 <dbl>, Q27 <dbl>, Q28 <dbl>, Q29 <dbl>, Q30 <dbl>,
## #   Q31 <dbl>, Q32 <dbl>, Q33 <dbl>, Q34 <dbl>, Q35 <dbl>, Q36 <dbl>,
## #   Q37 <dbl>, Q38 <dbl>, Q39 <dbl>, Q40 <dbl>, Q41 <dbl>, Q42 <dbl>,
## #   Q43 <dbl>, Q45 <dbl>, Q46 <dbl>, Q47 <dbl>, Q48 <dbl>, Q49 <dbl>,
## #   Q50 <dbl>, Q51 <dbl>, Q52 <dbl>, Q53 <dbl>, Q54 <dbl>, Q55 <dbl>,
## #   Q56 <dbl>, Q57 <dbl>, Q58 <dbl>, Q60 <dbl>, Q61 <dbl>, Q62 <dbl>,
## #   Q63 <dbl>, Q64 <dbl>, Q65 <dbl>, Q66 <dbl>, Q67 <dbl>, Q68 <dbl>,
## #   Q69 <dbl>, Q70 <dbl>, Q71 <dbl>, Q72 <dbl>, Q73 <dbl>, Q74 <dbl>,
## #   Q75 <dbl>, Q76 <dbl>, Q77 <dbl>, Q78 <dbl>, Q79 <dbl>, Q80 <dbl>,
## #   Q81 <dbl>, Q82 <dbl>, Q83 <dbl>, Q84 <dbl>, Q85 <dbl>, Q86 <dbl>,
## #   Q87 <dbl>, Q88 <dbl>, Q89 <dbl>, Q90 <dbl>, Q91 <dbl>, Q92 <dbl>,
## #   Q93 <dbl>, Q94 <dbl>, Q95 <dbl>, Q96 <dbl>, Q97 <dbl>, Q98 <dbl>,
## #   Q99 <dbl>, Q100 <dbl>, Q101 <dbl>, Q102 <dbl>, Q103 <dbl>, Q104 <dbl>,
## #   Q105 <dbl>, Q106 <dbl>, Q107 <dbl>, Q108 <dbl>, Q109 <dbl>, Q110 <dbl>,
## #   Q111 <dbl>, Q112 <dbl>, Q113 <dbl>, Q114 <dbl>, Q115 <dbl>, Q116 <dbl>, …

According to Alex’s score, we checked Q35, 105, 20, 30, 40, 73, and 80, we find that his answers are closer to the average score of cluster 3 (his answer of Q35, and 80 are closer to that of cluster2, Q105, 20, 30, 40 are closer to that of cluster 3, Q73 is closer to that of cluster 1).

Q35, 105 are questions identify introversion/extroversion, according to the average score of these two questions, we can infer that the people belong to cluster 2 is tend to be introverted. Q20, 30, 40 are question to be measuring “agreeableness” (to what degree an individual adjust their behavior to suit others.) The people in cluster 3 tend to be more agreeable and soft-heart.

Q73, 80 measure how much you value harmonious human relationship. People in cluster 1 tend to be enjoying harmonious human relationship.

From both results of big five and mbti, so we can infer that Alex is on the more introverted side, as well as being a agreeable person. He may also value harmonious social relationship.